-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Make dx
histogram behavior consistent with px
#1002
base: main
Are you sure you want to change the base?
Conversation
@@ -47,6 +47,27 @@ hist_3_bins = dx.histogram(setosa, x="SepalLength", nbins=3) | |||
hist_8_bins = dx.histogram(setosa, x="SepalLength", nbins=8) | |||
``` | |||
|
|||
### Bin and aggregate on different columns | |||
|
|||
If both `x` and `y` are specified, the histogram will be binned across one column and aggregated on the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would mention something like "If the plot orientation is vertical, the x
column will be binned and the y
column aggregated. The operations are flipped if the plot orientation is horizontal."
I don't know which orientation corresponds with which pairing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# only one should be set | ||
if hist_agg_label_h: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a bit clearer to make the args hist_agg_label
and hist_orientation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Column values must be numeric. If x is specified, | ||
the bars are drawn vertically by default. | ||
y: A column name or list of columns that contain y-axis values. | ||
Only one of x or y can be specified. If y is specified, the | ||
bars are drawn vertically. | ||
Column values must be numeric. If only y is specified, | ||
the bars are drawn horizontally by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure this has different behavior from bar in that the columns must be numeric. From the bar orientation docstring it looks like 1 value can be non-numeric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's intentional. Our histogram never allows non-numeric data. Our bar does. It's worth noting that the logic in calculate_bar_orientation
isn't called by bar when both x
and y
are specified, only when x
or y
is (as that is the logic that was formally known as frequency_bar
) so the orientation
in that case is set by the wrapped px.bar
function.
relabeled_agg_col = ( | ||
labels.get(self.agg_col, self.agg_col) if labels else self.agg_col | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ternary necessary here? Since there's a default empty dict for labels
, I think you can just get rid of the ternary since the else
should never hit? I guess unless a user specified labels=None
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's for when labels=None
elif self.histnorm == "probability": | ||
hist_agg_label = f"fraction of sum of {hist_agg_label}" | ||
elif self.histnorm == "percent": | ||
hist_agg_label = f"percent of sum of {hist_agg_label}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these X of sum of Y
? Looks like this is the case where histfunc
is not count or sum. So is there a histfunc
that should be required for these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how plotly labels it https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/express/_core.py#L239
More specifically though, hist_agg_label
would contain histfunc
at this point.
class UnivariateAwarePreprocessor: | ||
""" | ||
A preprocessor that stores useful args for plots where possibly one of x or y or both can be specified, | ||
which impacts the orientation of the plot in ways that affect the preprocessing. | ||
Should be inherited from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be abstract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pivot_vars: The vars with new column names if a list was passed in | ||
list_var: The var that was passed in as a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by what these do. Is pivot_vars
column renames? Is list_var
supposed to be something like x
or y
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pivot_vars
is kind of a rename - it's for when you pass in a list so the tables need to be stacked with resulting value
and variable
columns, although they may or may not be named exactly that if those columns already exist in the table. It's not a great name for that dict. I think I'll go ahead and refactor it as part of this.
list_var
would be x
or y
yes, whichever parameter was a list originally
if self.args.get(self.agg_var): | ||
self.agg_col: str = ( | ||
pivot_vars["value"] | ||
if pivot_vars and list_var and list_var == self.agg_var | ||
else args[self.agg_var] | ||
) | ||
else: | ||
# if bar_var is not set, the value column is the same as the axis column | ||
# because both the axis bins and value are computed from the same inputs | ||
self.agg_col = self.bin_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the comment in the else say agg_var
? The if only checks agg_var
, not bar_var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Fixes: #668
The following examples are now basically consistent.
During testing I discovered #1001 so the legend titles are not set in a couple of these, but that is beyond the scope of this ticket as that is a general plot by issue, not a histogram specific one.
Note a few intentional differences in the args. It's not expected that
dx
histograms andpx
histograms aggregate in the same exact way. Plotly infers the equivalent ofrange_bins
hence the extra argument. Additionally we havebarmode="group"
as our default whereas in Plotly it isbarmode="relative"
. I believe this was an intentional choice on our end because it's hard to compare the different traces withbarmode="relative"
although I don't have a record of it.